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[57] ABSTRACT 

Serial analysis of gene expression, SAGE, a method for the 
rapid quantitative and qualitative analysis of transcripts is 
provided. Short defined sequence tags corresponding to 
expressed genes are isolated and analyzed. Sequencing of 
over 1,000 defined tags in a short period of time (e.g., hours) 
reveals a gene expression pattern characteristic of the func- 
tion of a cell or tissue. Moreover, SAGE is useful as a gene 
discovery tool for the identification and isolation of novel 
sequence tags corresponding to novel transcripts and genes. 
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METHOD FOR SERIAL ANALYSIS OF GENE markers for the physical mapping of the genome. These 

EXPRESSION short sequences from physically mapped clones represent 

uniquely identified map positions in the genome. In contrast. 

This invention was made with support from National the identification of expressed genes relies on expressed 

Institutes of Health Grant Nos. CA57345, CA35494, and 5 sequence tags which are markers for those genes actually 

GM07309. The Government has certain rights in this inven- transcribed and expressed in vivo. 

tion. There is a need for an improved method which allows 



FIELD OF THE INVENTION 



rapid, detailed analysis of thousands of expressed genes for 
the investigation of a variety of biological applications, 

The present invention relates generally to the field of gene 10 particularly for establishing the overall pattern of gene 

expression and specifically to a method for the serial analy- expression in different cell types or in the same cell type 

sis of gene expression (SAGE) for the analysis of a large under different physiologic or pathologic conditions. Iden- 
number of transcripts by identification of a defined region of tification of different patterns of expression has several 

a transcript which corresponds to a region of an expressed utilities, including the identification of appropriate therapeu- 

gene. 15 tic targets, candidate genes for gene therapy (e.g., gene 

replacement), tissue typing, forensic identification, mapping 
BACKGROUND OF THE INVENTION locations of disease-associated genes, and for the identifi- 
Determination of the genomic sequence of higher cation of diagnostic and prognostic indicator genes, 
organisms, including humans, is now a real and attainable cTTxm A t)v ttjt? tm\ txjxttt^xt 
goal. However, this analysis only represents one level of 20 SUMMARY OF THE INVENTION 
genetic complexity. The ordered and timely expression of The present invention provides a method for the rapid 
genes represents another level of complexity equally impor- analysis of numerous transcripts in order to identify the 
tant to the definition and biology of the organism. overall pattern of gene expression in different cell types or 
The role of sequencing complementary DNA (cDNA), 5 in the same cell type under different physiologic, develop- 
reverse transcribed from mRNA, as part of the human mental or disease conditions. The method is based on the 
genome project has been debated as proponents of genomic identification of a short nucleotide sequence tag at a defined 
sequencing have argued the difficulty of finding every position in a messenger RNA. The tag is used to identify the 
mRNA expressed in all tissues, cell types, and developmen- corresponding transcript and gene from which it was tran- 
tal stages and have pointed out that much valuable infor- ^ scribed. By utilizing dimerized tags, termed a "ditag'\ the 
mation from intronic and intergenic regions, including con- method of the invention allows elimination of certain types 
trol and regulatory sequences, will be missed by cDNA of bias which might occur during cloning and/or amplifica- 
sequencing (Report of the Committee on Mapping and tion and possibly during data evaluation. Concatenation of 
Sequencing the Human Genome, National Academy Press, these short nucleotide sequence tags allows the efficient 
Washington, D.C, 1988). Sequencing of transcribed regions 35 analysis of transcripts in a serial manner by sequencing 
of the genome using cDNA libraries has heretofore been multiple tags on a single DNA molecule, for example, a 
considered unsatisfactory. Libraries of cDNAare believed to DNA molecule inserted in a vector or in a single clone, 
be dominated by repetitive elements, mitochondrial genes, The method described herein is the serial analysis of gene 
ribosomal RNA genes, and other nuclear genes comprising expression (SAGE), a novel approach which allows the 
common or housekeeping sequences. It is believed that ^ analysis of a large number of transcripts. To demonstrate this 
cDNA libraries do not provide all sequences corresponding strategy, short cDNA sequence tags were generated from 
to structural and regulatory polypeptides or peptides mRNA isolated from pancreas, randomly paired to form 
(Putney, et al., Nature* 302:718, 1983). ditags, concatenated, and cloned. Manual sequencing of 
Another drawback of standard cDNA cloning is that some 1,000 tags revealed a gene expression pattern characteristic 
mRNAs are abundant while others are rare. The cellular 45 of pancreatic function. Identification of such patterns is 
quantities of mRNA from various genes can vary by several important diagnostically and therapeutically,, for example, 
orders of magnitude. Moreover, the use of SAGE as a gene discovery tool was 
Techniques based on cDNA subtraction or differential documented by the identification and isolation of new pan- 
display can be quite useful for comparing gene expression creatic transcripts corresponding to novel tags. SAGE pro- 
differences between two cell types (Hedrick, et al., Nature, 50 a taoadly applicable means for the quantitative cata- 
308: 149, 1984; liang and Pardee, Science, 257: 967, 1992), l°ging and comparison of expressed genes in a variety of 
but provide only a partial analysis, with no direct inforrna- normal, developmental, and disease states. 

tion regarding abundance of messenger RNA. The expressed rriff nPSPRTPrrnN OF tht? rm awtng^ 

sequence tag (EST) approach has been shown to be a BRIEF DESCRIPTION OF THE DRAWINGS 

valuable tool for gene discovery (Adams, et al., Science 55 PIG. 1 shows a schematic of SAGE. The first restriction 

252:1656* 1991; Adams, et al., Nature, 355:632, 1992; enzyme, or anchoring enzyme, is NlaHI and the second 

Okubo et al., Nature Genetics, 2: 173, 1992), but like enzyme, or tagging enzyme, is Fokl in this example. 

Northern blotting, RNase protection, and reverse Sequences represent primer derived sequences, and tran- 

txanscriptase-polymerase chain reaction (KT-PCR) analysis script derived sequences with ts X** and "O" representing 

(Alwine, et al, Proc. Natl Acad Set, E/.S.A, 74:5350, 1977; 60 nucleotides of different tags. 

Zinn et aL, Cell 34:865, 1983; Veres, et al., Science, FIG. 2 shows a comparison of transcript abundance. Bars 

237:415, 1987), only evaluates a limited number of genes at represent the percent abundance as determined by SAGE 

a time. In addition, the EST approach preferably employs (dark bars) or hybridization analysis (light bars). SAGE 

nucleotide sequences of 150 base pairs or longer for simi- quantitations were derived from Table 1 as follows: TRY1/2 

larity searches and mapping. 65 includes the tags for trypsinogen 1 and 2, PROCAR indi- 

Sequence tagged sites (STSs) (Olson, et al., Science, cates tags for procarboxypeptidase Al, CHYMO indicates 

245: 1434, 1989) have also been utilized to identify genomic tags for chymotrypsinogen, and ELA/PRO includes the tags 
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for elastase TTIR and protease E. Error bars represent the 
standard deviation determined by taking the square root of 
counted events and converting it to a percent abundance 
(assumed Poisson distribution)* 

FIG. 3 shows the results of screening a cDNA library with 5 
SAGE tags. PI and P2 show typical hybridization results 
obtained with 13 bp oligonucleotides as described in the 
Examples. PI and P2 correspond to the transcripts described 
in Table 2. Images were obtained using a Molecular Dynam- 
ics Phosphorltnager and the circle indicates the outline of 10 
the filter membrane to which the recombinant phage were 
transferred prior to hybridization. 

DESCRIPTION OF THE PREFERRED 

EMBODIMENTS 15 

The present invention provides a rapid, quantitative pro- 
cess for determining the abundance and nature of transcripts 
corresponding to expressed genes. The method, termed 
serial analysis of gene expression (SAGE), is based on the 2Q 
identification of and characterization of partial, defined 
sequences of transcripts corresponding to gene segments. 
These defined transcript sequence "tags" are markers for 
genes which are expressed in a cell, a tissue, or an extract, 
for example. ^ 

SAGE is based on several principles. First, a short nucle- 
otide sequence tag (9 to 10 bp) contains sufficient informa- 
tion content to uniquely identify a transcript provided it is 
isolated from a defined position within the transcript. For 
example, a sequence as short as 9 bp can distinguish 262, 30 
144 transcripts (4 9 ) given a random nucleotide distribution 
at the tag site, whereas estimates suggest that the human 
genome encodes about 80,000 to 200,000 transcripts (Fields, 
et al., Nature Genetics, 7345 1994). The size of the tag can 
be shorter for lower eukaryotes orprokaryotes, for example, 35 
where the number of transcripts encoded by the genome is 
lower. For example, a tag as short as 6-7 bp may be 
sufficient for distinguishing transcripts in yeast 

Second, random dimerization of tags allows a procedure 
for reducing bias (caused by amplification and/or cloning). 40 
Third, concatenation of these short sequence tags allows the 
efficient analysis of transcripts in a serial manner by 
sequencing multiple tags within a single vector or clone. As 
with serial communication by computers, wherein informa- 
tion is transmitted as a continuous string of data, serial 45 
analysis of the sequence tags requires a means to establish 
the register and boundaries of each tag. All of these prin- 
ciples may be applied independently, in combination, or in 
combination with other known methods of sequence iden- 
tification. 50 

In a first embodiment, the invention provides a method for 
the detection of gene expression in a particular cell or tissue, 
or cell extract, for example, including at a particular devel- 
opmental stage or in a particular disease state. The method 
comprises producing complementary deoxyribonucleic acid 55 
(cDNA) oligonucleotides, isolating a first defined nucleotide 
sequence tag from a first cDNA oligonucleotide and a 
second defined nucleotide sequence tag from a second 
cDNA oligonucleotide, linking the first tag to a first oligo- 
nucleotide linker, wherein the first oligonucleotide linker 60 
comprises a first sequence for hybridization of an amplifi- 
cation primer and linking the second tag to a second oligo- 
nucleotide linker, wherein the second oligonucleotide linker 
comprises a second sequence for hybridization of an ampli- 
fication primer, and determining the nucleotide sequence of 65 
the tag(s), wherein the tag(s) correspond to an expressed 
gene. 
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FIG. 1 shows a schematic representation of the analysis of 
messenger RNA (mRNA) using SAGE as described in the 
method of the invention. mRNA is isolated from a cell or 
tissue of interest for in vitro synthesis of a double-stranded 
DNA sequence by reverse transcription of the mRNA. The 
double-stranded DNA complement of mRNA formed is 
referred to as complementary (cDNA). 

The term "oligonucleotide" as used herein refers to prim- 
ers or oligomer fragments comprised of two or more deox- 
yribonucleotides or ribonucleotides, preferably more than 
three. The exact size will depend on many factors, which in 
turn depend on the ultimate function or use of the oligo- 
nucleotide. 

The method further includes Hgating the first tag linked to 
the first oligonucleotide linker to the second tag linked to the 
second oligonucleotide linker and forming a "ditag". Each 
ditag represents two defined nucleotide sequences of at least 
one transcript, representative of at least one gene. Typically, 
a ditag represents two transcripts from two distinct genes. 
The presence of a defined cDNA tag within the ditag is 
indicative of expression of a gene having a sequence of that 
tag. 

The analysis of ditags, formed prior to any amplification 
step, provides a means to eliminate potential distortions 
introduced by amplification, e.g., PCR. The pairing of tags 
for the formation of ditags is a random event. The number 
of different tags is expected to be large, therefore, the 
probability of any two tags being coupled in the same ditag 
is small, even for abundant transcripts. Therefore, repeated 
ditags potentially produced by biased standard amplification 
and/or cloning methods are excluded from analysis by the 
method of the invention. 

The term "defined" nucleotide sequence, or "defined" 
nucleotide sequence tag, refers to a nucleotide sequence 
derived from either the 5 f or 3' terminus of a transcript. The 
sequence is defined by cleavage with a first restriction 
endonuclease, and represents nucleotides either 5* or 3' of 
the first restriction endonuclease site, depending on which 
tenninus is used for capture (e.g., 3' when oligo-dT is used 
for capture as described herein). 

As used herein, the terms "restriction endonucleases" and 
"restriction enzymes" refer to bacterial enzymes which bind 
to a specific double-stranded DNA sequence termed a rec- 
ognition site or recognition nucleotide sequence, and cut 
double-stranded DNA at or near the specific recognition site. 

The first endonuclease, termed "anchoring enzyme" or 
"AE" in FIG. 1, is selected by its ability to cleave a transcript 
at least one time and therefore produce a defined sequence 
tag from either the 5* or 3' end of a transcript. Preferably, a 
restriction endonuclease having at least one recognition site 
and therefore having the ability to cleave a majority of 
cDNAs is utilized. For example, as illustrated herein, 
enzymes which have a 4 base pair recognition site are 
expected to cleave every 256 base pairs (4 4 ) on average 
while most transcripts are considerably larger. Restriction 
endonucleases which recognize a 4 base pair site include 
NlalH, as exemplified in the EXAMPLES of the present 
invention. Other similar endonucleases having at least one 
recognition site within a DNA molecule (e„g,, cDNA) will be 
known to those of skill in the art (see for example, Current 
Protocols in Molecular Biology, Vol. 2, 1995, Ed. Ausubel, 
et al., Greene Publish. Assoc. & Wiley Inferscience, Unit 
3.U5; New England Biolabs Catalog, 1995). 

After cleavage with the anchoring enzyme, the most 5* or 
3' region of the cleaved cDNA can then be isolated by 
binding to a capture medium. For example, as illustrated in 
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the present EXAMPLES, streptavidin beads are used to 3 ATGGTCGAATAAGTTAAGCCAGGAGAGCGTG- 

isolate the defined 3' nucleotide sequence tag when the oligo TCCCT-5' (SEQ ID NO:2) and 

dT primer for cDNA synthesis is biotinylated. In this 5 ._ TTTTTGTAGACArTCTAGTArcrCGTCAAGTC- 

example, cleavage with the first or anchoring enzyme pro- GGAAGGGACATG-3' (SEQ ID NO:3) 

vides a unique site on each transcript which corresponds to 5 ^ t A rATr-rrT A ATrATAP Ar^A^r a ^ 

the restriction site located closest to the poly-A tail. ^ C f A . GAC^AGTTCAGCC- 

likewise, the S cap of a transcript (the cDNA) can be TTCCCT-5 (SEQ ID NO:4), wherein A is a dideoxy nucle- 

utilized for labeling or binding a capture means for isolation otlde ( e *S-> a"koxy A). Other similar linkers can be utilized 

of a 5* defined nucleotide sequence tag. Those of skill in the in * e method of the invention; those of skill in the art can 

art will know other similar capture systems (e.g., biotin/ design such alternate linkers. 

streptavidrn, digoxigeiiin/anti-digoxigenin) for isolation of 10 The linkers are designed so that cleavage of the ligation 

the defined sequence tag as described herein. products with the second restriction enzyme, or tagging 

The invention is not limited to use of a single "anchoring** enzyme, results in release of the linker having a defined 

or first restriction endonuciease. It may be desirable to nucleotide sequence tag (e.g., 3' of the restriction endonu- 

perform the method of the invention sequentially, using clease cleavage site as exemplified herein). The defined 

different enzymes on separate samples of a preparation, in nucleotide sequence tag may be from about 6 to 30 base 

order to identify a complete pattern of transcription for a cell pairs. Preferably, the tag is about 9 to 11 base pairs, 

or tissue. In addition, the use of more than one anchoring Therefore, a ditag is from about 12 to 60 base pairs, and 

enzyme provides confirmation of the expression pattern preferably from 18 to 22 base pairs. 

obtained from the first anchoring enzyme. Therefore, it is ^ The pool of defined tags ligated to linkers having the same 

also envisioned that the first or anchoring endonuciease may sequence, or the two pools of defined nucleotide sequence 

rarely cut cDNA such that few or no cDNA representing tags ligated to linkers having different nucleotide sequences, 

abundant transcripts arc cleaved. Thus, transcripts which are are randomly ligated to each other "tail to tail". The portion 

cleaved represent "unique** transcripts. Restriction enzymes of the cDNA tag furthest from the linker is referred to as the 

that have a 7-8 bp recognition site for example, would be "tail". As illustrated in FIG. 1, the ligated tag pair, or ditag, 

enzymes that would rarely cut cDNA. Similarly, more than has a first restriction endonuciease site upstream (5 1 ) and a 

one tagging enzyme, described below, can be utilized in first restriction endonuciease site downstream (3') of the 

order to identify a complete pattern of transcription. ditag; a second restriction endonuciease cleavage site 

The term "isolated" as used herein includes polynucle- upstream and downstream of the ditag, and a linker oligo- 

otides substantially free of other nucleic acids, proteins, 3Q nucleotide containing both a second restriction enzyme 

lipids, carbohydrates or other materials with which it is recognition site and an amplification primer hybridization 

naturally associated. cDNA is not naturally occurring as site upstream and downstream of the ditag. In other words, 

such, but rather is obtained via manipulation of a partially the ditag is flanked by the first restriction endonuciease site, 

purified naturally occurring mRNA. Isolation of a defined the second restriction endonuciease cleavage site and the 

sequence tag refers to the purification of the 5* or 3* tag from 35 linkers, respectively. 

other cleaved cDNA. The ditag can be amplified by utilizing primers which 

In one embodiment; the isolated defined nucleotide specifically hybridize to one strand of each linker, 

sequence tags are separated into two pools of cDNA, when Preferably, the amplification is performed by standard poly- 

the linkers have different sequences. Each pool is ligated via merase chain reaction (PCR)methods as described (U.S. Pat 

the anchoring, or first restriction endonuciease site to one of ^ No. 4,683,195). Alternatively, the ditags can be amplified by 

two linkers. When the linkers have the same sequence, it is cloning in procaryotic-compatible vectors or by other ampli- 

not necessary to separate the tags into pools. The first fication methods known to those of skill in the art 

oHgonucleotidefinker comprises a first sequence for hybrid- The term "primer" as used herein refers to an 

ization of an amplification primer and fine second oligo- oligonucleotide, whether occurring naturally or produced 

nucleotide linker comprises a second sequence for hybrid- 45 synthetically, which is capable of acting as a point of 

ization of an amplification primer. In addition, the linkers initiation of synthesis when placed under conditions in 

further comprise a second restriction endonuciease site, also which synthesis of primer extension product which is 

termed the "tagging enzyme" or *TE". The method of the complementary to a nucleic acid strand is induced, i.e., in the 

invention does not require, but preferably comprises ampli- presence of nucleotides and an agent for rx>lymerization 

tying the ditag oligonucleotide after ligation. 50 such as DNA polymerase and at a suitable temperature and 

The second restriction endonuciease cleaves at a site pH. The primer is preferably single stranded for maximttm 

distant from or outside of me recognition site. For example, efficiency in amplification. Preferably, the primer is an 

the second restriction endonuciease can be a type US restric- oligodeoxy ribonucleotide. The primer must be sufficiently 

tion enzyme. Type US restriction endonucleases cleave at a long to prime the synthesis of extension products in the 

defined distance up to 20 bp away from their asymmetric 55 presence of the agent for polymerization. The exact lengths 

recognition sites (Szybaiski, W., Gene, 40: 169, 1985). of the primers will depend on many factors, including 

Examples of type ICS restriction endonucleases include ■ temperature and source of primer. 

BsmFI and Fokt Other similar enzymes will be known to The primers herein are selected to be "substantially** 

mose of skul inme art (see t Ctf/TVrtri^ complementary to the different strands of each specific 

Biology, supra). so sequence to be amplified. This means that the primers must 

The first and second "linkers" which are ligated to the be sufficiently complementary to hybridize with their 

defined nucleotide sequence tags are oligonucleotides hav- respective strands. Therefore, the primer sequence need not 

ing the same or different nucleotide sequences. For example, reflect the exact sequence of the template. In the present 

the linkers illustrated in the Examples of the present inven- invention, the primers are substantially complementary to 

tion include linkers having different sequences: 65 the oligonucleotide linkers. 

5-TTTTACCAGCirATTCAATTCGGTCCTCrCGCA- Primers useful for amplification of the linkers exemplified 

CAGGGACATG-3* (SEQ ID NO:l) herein as SEQ ID NO: 1-* include S'-CCAGCTTXTTCA- 
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ATTCGGTCC-3* (SEQ ID NO:5) and 5 -GTAGACXTTC- In another embodiment, the present invention provides a 
TAGTATCTCGT-3' (SEQ ID NO: 6). Those of skill in the art kit used for detection of gene expression wherein the pres- 
can prepare similar primers for amplification based on (he ence of a defined nucleotide tag or ditag is indicative of 
nucleotide sequence of the linkers without undue experi- expression of a gene having a sequence of the tag, the kit 
mentation. 5 comprising one or more containers comprising a first con- 
Cleavage of the amplified PCR product with the first tainer containing a first oligonucleotide linker having a first 
restriction endonuclease allows isolation of ditags which can sequence useful hybridization of an amplification primer; a 
be concatenated by ligation. After ligation, it may be desir- second container containing a second oligonucleotide linker 
able to clone the concatemers, although it is not required in having a second oligonucleotide linker having a second 
the method of the invention. Analysis of theditags or w sequence used hybridization of an amplification primer, 
concatemers, whether or not amplification was performed, is wnereill ^ jj^ers further comprise a restriction endonu- 
by standard "^Tf™* ^ tfa ° dS * 'P^T^J 6 **^* clease site for cleavage of DN A at a site distant from the 
con^st of abou^ to 200 ditags and preferably from abou resttiction endonucie ase recognition site; and a third and 
8 to 20 ditags ,We meseare preferred concatemers, it will fourth haym a nu ^ eic acid primers for hybrid _ 

be apparent that the number of ditags which can be concat- ^. . ^ , . ^ , , 

^ " * ' " , . . . , * . 15 ization to the first and second unique sequence of the linker. 

cnated will depend on the length of the individual tags and T . . , : f +u~ *u^ n „At2^Ao ~~ 

. ... v. *■ j Jz ■ * i .„ . — x It is apparent mat if the oligonucleotide linkers comprise the 

can be readily determined by those of skill in the art without . „. . * * • ^ . • 

^u^*^««i*jr l£ v *«o^x*i« vr*«* v same nucleotide sequence, only one container containing 

undue experimentation. After formation of concatemers, . ^ 

,,,5 . . j . - linkers is necessary in the kit or the invention, 

multiple togs can be cloned mto a vector for sequence « 

analysis, or alternatively, ditags or concatemers can be 111 y* embodiment, the invention provides an 

directly sequenced without cloning by methods known to 20 oligonucleotide composition having at least two defined 

those of skill in the art nucleotide sequence tags, wherein at least one of the 

Among the standard* procedures for cloning the defined Wiice tags corresponds to at least one expressed gene 

nucleotide sequence tags of the invention is insertion of the Tte composition consists of about i to 200 ditags, and 

tags into vectors such as plasmids or phage. The ditag or P^erably about 8 to 20 ditags. Such cornposmons are 

concatemers of ditags produced by the method described 25 ^ ^J S1S ° f gene e *P rcS510n ^ ^tifying the 

herein are cloned mto recombinant vectors for further deftned «™leotide sequence tag correspondmg to an 

analysis, e.g., sequence analysis, plaque/plasmid hybridiza- expressed gene in a cell, tissue or cell extract, for example, 

tion using the tags as probes, by methods known to those of T« e following examples are intended to illustrate but not 

skill in the art. Until the invention. While they are typical of those that might 

The term "recombinant vector" refers to a plasmid, virus 3 ° be used > other procedures known to those skilled in the art 

or other vehicle known in the art mat has been manipulated ma y alternatively be used, 

by insertion or incorporation of the ditag genetic sequences. For ampt F 5 ? 
Such vectors contain a promoter sequence which facilitates 

the efficient transcription of the a marker genetic sequence 35 For exemplary purposes, the SAGE method of the inven- 
tor example. The vector typically contains an origin of tion was used to characterized gene expression in the human 
replication, a promoter, as well as specific genes which pancreas. NlalU was utilized as the first restriction 
allow phenotypic selection of the transformed cells. Vectors endonuclease, or anchoring enzyme, and BsmFI as the 
suitable for use in the present invention include for example, second restriction endonuclease, or tagging enzyme, yield- 
pBlueScript (Stratagene, La Jolla, Calif.); pBC, pSL301 ^ ing a 9 bp tag (BsmFI was predicted to cleave the comple- 
(Invitrogen) and other similar vectors known to those of skill mentary strand 14 bp 3* to the recognition site GGGAC and 
in the art. Preferably, the ditags or concatemers thereof are to yield a 4 bp 5' overhang (New England BioLabs). 
ligatedinto a vector for sequencing purposes. Overlappmgme BsmFI and NlallC(CArG) sites as indicated 

Vectors in which the ditags are cloned can be transferred (GGGACATG) would be predicted to result in a 11 bp tag. 
into a suitable host cell. "Host cells'* are cells in which a 45 However, analysis suggested that under the cleavage con- 
vector can be propagated and its DNA expressed. The term ditions used (37° C), BsmH often cleaved closer to its 
also includes any progeny of the subject host cell. It is recognition site leaving a minimum of 12 bp 3' of its 
understood that all progeny may not be identical to the recognition site. Therefore, only the 9 bp closest to the 
parental cell since there may be mutations that occur during anchoring enzyme site was used for analysis of tags. Cleav- 
replication. However, such progeny are included when the 50 age at 65° C. results in a more consistent 11 bp tag. 
term "host cell" is used. Methods of stable transfer, meaning Computer analysis of human transcripts from Gen Bank 
that the foreign DNA is continuously main t ain ed in the host, indicated that greater than 95% of tags of 9 bp in length were 
are known in the art. likely to be unique and that inclusion of two additional bases 

Transformation of a host cell with a vector contaimng provided little additional resolution. Human sequences (84, 

ditag(s) may be carried out by conventional techniques as 55 300) were extracted from the GenBank 87 database using 

are well known to those skilled in the art. Where the host is the Findseq program provided on the IntelliGenetics Bionet 

prokaryotic, such as E* colt, competent cells which are on-line service. All further analysis was performed with a 

capable of DNA uptake can be prepared from cells harvested SAGE program group written in Microsoft Visual Basic for 

after exponential growth phase and subsequently treated by the Microsoft Windows operating system. The SAGE data- 

the CaQ 2 method using procedures well known in the art 60 base analysis program was set to include only sequences 

Alternatively, MgCl 2 or RbCl can be used. Transformation noted as **RNA" in the locus description and to exclude 

can also be performed by electropotation or other commonly entries noted as "EST", resulting in a reduction to 13J241 

used methods in the art sequences. Analysis of this subset of sequences using NlalH 

The ditags present in a particular clone can be sequenced as anchoring Enzyme indicated that 4,127 nine bp tags were 

by standard methods (see for example, Current Protocols in 65 unique while 1,511 tags were found in more than one entry. 

Molecular Biology, supra, Unit 7) either manually or using Nucleotide comparison of a randomly chosen subset (100) 

automated methods. of the latter entries indicated that at least 83% were due to 
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redundant data base entries for the same gene or highly 
related genes (>95% identity over at least 250 bp). This 
suggested that 5381 of the 9 bp tags (95.5%) were unique to 
a transcript or highly conserved transcript family. Likewise, 
analysis of the same subset of GenBank with an 11 bp tag 
resulted only in a 6% decrease in repeated tags (1511 to 
1425) instead of the 94% decrease expected if the repeated 
tags were due to unrelated transcripts. 

Example 1 

As outlined above, mRNA from human pancreas was 
used to generate ditags. Briefly, five ug mRNA from total 
pancreas (Clontech) was converted to double stranded 
cDNA using a BKL cDNA synthesis kit following the 
manufacturer's protocol, using the primer biotin-5T 18 -3'. 
The cDNA was then cleaved with NlaHI and the 3* restric- 
tion fragments isolated by binding to magnetic streptavidin 
beads (Dynal). The bound DNA was divided into two pools, 
and one of the following linkers ligated to each pool: 

5'- 1 1 1 1 ACC AGCTTAITC AATTCGGTCCTCTCGCA- 
CAGGGACAXG-3' 3 - ATGGTCGAATAAGTTAAGCC A- 
GGAGAGCGTGTCCCT-5* (SEQ ID NO:l and 2) 

S'-TTTTTGTAGACAITCTAGTArCrCGTCAAGTCG- 
GAAGGGACATG-3* 3 , -AACATCTGTAAGATCATAGA- 
GCAGTTCAGCCTTCCCT-5' (SEQ ID NO:3 and 4), where 
A is a dideoxy nucleotide (e.g., dideoxy A). 

After extensive washing to remove unligated linkers, the 
linkers and adjacent tags were released by cleavage with 
BsmFL The resulting overhangs were filled in with T4 
polymerase and the pools combined and ligated to each 
other. The desired ligation product was then amplified for 25 
cycles using 5 , -CCAGCTTArTCAArTCGGTCC-3' and 
5 f -GTAGACAITCTAGTXrCrCGT-3' (SEQ ID NO: 5 and 
6, respectively) as primers. The PCR reaction was then 
analyzed by polyacrylamide gel electrophoresis and the 
desired product excised. An additional 15 cycles of PCR 
were then performed to generate sufficient product for 
efficient ligation and cloning. 

The PCR ditag products were cleaved with Nlalll and the 
band containing the ditags was excised and self-ligated. 
After ligation, the concatenated ditags were separated by 
polyacylamide gel electrophoresis and products greater than 
200 bp were excised. These products were cloned into the 
SphI site of pSL301 (Invitrogen). Colonies were screened 
for inserts by PCR using T7 and T3 sequences outside the 
cloning site as primers. Clones containing at least 10 tags 
(range 10 to 50 tags) were identified by PCR amplification 
and manually sequenced as described (Del Sal, et al., 
Biotechniques 7:514, 1989) using 
5'-GACGTCGACCTGAGGTAArTXTAACC-3 ' (SEQ ID 
NO:7) as primer. Sequence files were analyzed using the 
SAGE software group which identifies the anchoring 
enzyme site with the proper spacing and extracts the two 
intervening tags and records them in a database. The 1,000 
tags were derived from 413 unique ditags and 87 repeated 
ditags. The latter were only counted once to eliminate 
potential PCR bias of the quantitation. The function of 
SAGE software is merely to optimize the search for gene 
sequences. 

Table 1 shows analysis of the first 1,000 tags. Sixteen 
percent were eliminated because they either had sequence 
ambiguities or were derived form linker sequences. The 
remaining 840 tags included 351 tags that occurred once and 
77 tags that were found multiple times. Nine of the ten most 
abundant tags matched at least one entry in GenBank R87. 
The remaining tag was subsequently shown to be derived 



from amylase. All ten transcripts were derived from genes of 
known pancreatic function and their prevalence was consis- 
tent with previous analyses of pancreatic RNA using con- 
ventional approaches (Han, et al„ Proc. NatL Acad Set. 
U.S.A. 83:110, 1986; Takeda, et al., Hum, MoL Gen., 2: 
1793, 1993). 

TABLE 1 
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Pancreatic SAGE Tags 



TAG 



Gene 



GAGCACACC 
TTCTGTGTG 
15 GAACACAAA 
TCAGGGTGA 
GCGTGACCA 
GTGTGTGCT 
TCATTGGCC 
CCAGAGAGT 
20 TCCTCAAAA 
AGCCTTGGT 
GTGTGCGCT 
TGCGAGACC 
GTGAAACCC 
GGTGACTCT 
AAGGTAACA 
TCCCCTGTG 
GTGACCACG 
CCTGTAATC 
CACGTTGGA 
AGCCCTACA 
AGCACCTCC 
AC GC AGGG A 
AATTGAAGA 
TTCTGTGGG 
TTCATACAC 
GTGGCAGGC 
GTAAAACCC 

35 

GAACACACA 
CCTGGGAAG 
CCCATCGTC 
(SEQ ID 
NO:8-37) 
40 Summary 
SAGE tags 
Occurring 



45 



25 



30 



Procarboxypeptidase Al (X67318) 
Pancreatic IVypsioogea 2 (M27602) 
Chymotrypsinogen (M24400) 
Pancreatic Trypsin 1 (M22612) 
Elastase 111B (Ml 8692) 
Protease E (D00306) 
Pancreatic Lipase (M93235) 
Procarboxypeptidase B (M810.57) 
No Match, See Table 2, PI 
Bile Salt Stimulated Lipase (X54457) 
No Match 

No Match, See Table 2, P2 
21 Alu entries 
No Match 

Secretary Trypsin Inhibitor (M11949) 
No Match 
No Match 

M91159, M29366, II Alu entries 
No Match 
No Match 

Elongation Factor 2 (Z11692) 
No Match, See Table 2, P3 
No Match, See Table 2, P4 
No Match 
No Match 

NF-kB(X6l499) t Aha entry (S94541) 
TNF receptor U (M55994), 
Alu entry (X01448) 
No Match 

Pancreatic Mucin (J05582) 
Mitochondrial CytC Oxidase 
(X15759) 



Greater than three times 
Three times (15 X 3=) 
Two tunes (32 X 2=) 
One time 





Per- 


N 


cent 


64 


7.6 


46 


5.5 


37 


4.4 


31 


3.7 


20 


2.4 


16 


19 


16 


19 


14 


1.7 


14 


1.7 


12 


1.4 


11 


1.3 


9 


1.1 


8 


1.0 


8 


1.0 


6 


0.7 


5 


0.6 


5 


0.6 


5 


0.6 


5 


0.6 


5 


0.6 


5 


0.6 


5 


0.6 


5 


0.6 


4 


0.5 


4 


0.5 


4 


0.5 


4 


0.5 


4 


0.5 


4 


0.5 


4 


0.5 


380 


45.2 


45 


5.4 


64 


7.6 


351 


41.8 



Total SAGE Tags 



840 100.0 



50 



55 



60 



65 



*Tag'* indicates the 9 bp sequence unique to each tag, 
adjacent to the 4 bp anchoring NlalH site, "N" and "Percent" 
indicates the number of times the tag was identified and its 
frequency, respectively. "Gene" indicates the accession 
number and description of GenBank R87 entries found to 
match the indicated tag using the SAGE software group with 
the following exceptions. When multiple entries were iden- 
tified because of duplicated entries, only one entry is listed. 
In the cases of chymotrypsinogen, and trypsinogen 1, other 
genes were identified that were predicted to contain the same 
tags, but subsequent hybridization and sequence analysis 
identified the listed genes as the source of the tags. "Alu 
entry** indicates a match with a GenBank entry for a tran- 
script that contained at least one copy of the alu consensus 
sequence (Deininger, et al., /. MoL BioL, 151:17, 1981). 

Example 2 

The quantitative nature of SAGE was evaluated by con- 
struction of an oligo-dT primed pancreatic cDNA library 
which was screened with cDNA probes for trypsinogen 1/2, 
procarboxypeptidase Al, chymotrypsinogen and elastase 
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mB/protease E. Pancreatic mRNA from the same prepara- 
tion as used for SAGE in Example 1 was used to construct 
a cDNA library in the ZAP Express vector using the ZAP 
Express cDNA Synthesis kit following the manufacturer's 
protocol (Stratagene). Analysis of 15 randomly selected 
clones indicated that \00% contained cDNA inserts* Plates 
containing 250 to 500 plaques were hybridized as previously 
described (Ruppert, et al., MoL Cell Biol 83104, 1988). 
cDNA probes for trypsinogen 1, tripsinogen 2, procarbox- 
ypeptidase Al, chymotrypsinogen, and elastase TTT R were 
derived by RT-PCR from pancreas RNA. The trypsinogen 1 
and 2 probes were 93% identical and hybridized to the same 
plaques under the conditions used likewise, the elastase 
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that predicted by SAGE (Table 2). Tags PI and P2 were 
found to correspond to amylase and prepro carboxyp eptida s e 
A2, respectively. No entry for preprocarboxypeptidase A2 
and only a truncated entry for amylase was present in 
GenBank R87, thus accounting for their unassigned char- 
acterization. Tag P3 did not match any genes of known 
function in GenBank but did match numerous EST's, pro- 
viding further evidence that it represented a bona fide 
transcript. The cDNA identified by P4 showed no significant 
homology, suggesting that it represented a previously 
uncharacterized pancreatic transcript. 



TABLE 2 



TAG 



Abun- 
dance 



Characterization of Unassigned SAGE Tags 
SAGE 



SAGE 



13 mcr Hyb Tag Description 



PI TCCTCAAAA 
(SEQ ID NO:38) 
P2 TGCGAGACC 
(SEQ ID NO:39) 
P3 ACGCAGGGA 
(SEQ ID NO:40) 
P4AATTGAAGA 
(SEQ ID NO:41) 



1.7% 1.5% (6/388) + 3* end of Pancreatic Amylase (M2S443) 

1.1% 1.2% (43/3700) + 3' end of Preprocarboxypeptidase A2 
(U19977) 

0.6% 0.2% (5/2772) + EST match (R45808) 
0.6% 0.4% (6ttS87) + uo match 



IHB probe and protease E probe were over 95% identical 
and hybridized to the same plaques* 

The relative abundance of the SAGE tags for these 
transcripts was in excellent agreement with the results 
obtained with library screening (FIG. 2). Furthermore, 
whereas neither trypsinogen 1 and 2 nor elastase 111B and 
protease £ could be distinguished by the cDNA probes used 
to screen the library, all four transcripts could readily be 
distinguished on the basis of their SAGE tags (Table 1). 

Example 3 

In addition to providing quantitative information on the 
abundance of known transcripts, SAGE could be used to 
identify novel expressed genes. While for the purposes of 
the SAGE analysis in this example, only the 9 bp sequence 
unique to each transcript was considered, each SAGE tag 
defined a 13 bp sequence composed of the anchoring 
enzyme (4 bp) site plus the 9 bp tag. To illustrate this 
potential, 13 bp oligonucleotides were used to isolate the 
transcripts corresponding to four unassigned tags (PI to P4), 
that is, tags without corresponding entries from GenBank 
R87 (Table 1). In each of the four cases, it was possible to 
isolate multiple cDKA clones for the tag by simply screen- 
ing the pancreatic cDNA library using 13 bp oligonucleotide 
as hybridization probe (examples in FIG* 3). 

Plates containing 250 to 2,000 plaques were hybridized to 
oligonucleotide probes using the same conditions previously 
described for standard probes except that the hybridization 
temperature was reduced to room temperature. Washes were 
performed in 6xSSC/0.1% SDS for 30 minutes at room 
temperature. The probes consisted of 13 bp oligonucleotides 
which were labeled with t^P-XTP using T4 polynucleotide 
kinase. In each case, sequencing of the derived clones 
identified the correct SAGE tag at the predicted 3* end of the 
identified transcript The abundance of plaques identified by 
hybridization with the 13-mers was in good agreement with 



30 

'Tag" and "SAGE Abundance" are described in Table 1; 
"13mer Hyb" indicates the results obtained by screening a 
cDNA library with a 13mer, as described above. The number 
of positive plaques divided by the total plaques screened is 

35 indicated in parentheses following the percent abundance. A 
positive in the "SAGE Tag" column indicates that the 
expected SAGE tag sequence was identified near the 3* end 
of isolated clones. "Description" indicates the results of 
BLAST searches of the daily updated GenBank entries at 

40 NCBI a of 6/9/95 (Altschul, et aU /. MoL Biol, 215:403, 
1990). A description and Accession number are given for the 
most significant matches. PI was found to match a truncated 
entry for amylase, and P2 was found to match an unpub- 
lished entry for preprocarboxypeptidase A2 which was 

45 entered after GenBank RS7. 

These results demonstrate that SAGE provides both quan- 
titative and qualitative data about gene expression. The use 
of different anchoring enzymes and/or tagging enzymes with 
various recognition elements lends great flexibility to this 

50 strategy. In particular, since different anchoring enzymes 
cleave cDNA at different sites, the use of at least 2 different 
Aes on different samples of the same cDNA preparation 
allows confirmation of results and analysis of sequences that 
might not contain a recognition site for one of the enzymes. 

55 As efforts to fully characterize the genome near 
completion, SAGE should allow a direct readout of expres- 
sion in any given cell type or tissue. In the interim, a major 
application of SAGE will be the comparison of gene expres- 
sion patterns in among tissues and in various developmental 

60 and disease states in a given cell or tissue. One of skill in the 
art with the capability to perform PCR and manual sequenc- 
ing could perform SAGE for this purpose. Adaptation of this 
technique to an automated sequencer would allow the analy- 
sis of over 1,000 transcripts in a single 3 hour run* An ABI 

65 377 sequencer can produce a 451 bp readout for 36 tem- 
plates in a 3 hour run (451bp/l Ibp per tagx36=1476 tags). 
The appropriate number of tags to be deterrnined will 
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depend on the application. For example, the definition of many utilities for SAGE technology, is the identification of 

genes expressed at relatively high levels (0.5% or more) in appropriate antisense or triple helix reagents which may be 

one tissue, but low in another, would require only a single therapeutically useful. Further, gene therapy candidates can 

day. Determination of transcripts expressed at greater than also be identified by the SAGE technology. Other uses 

100 mRNA's per cell (0.025% or more) should be quanti- 5 include diagnostic applications for identification of indi- 

fiable within a few months by a single investigator. Use of visual genes or groups of genes whose expression is shown 

two different Anchoring Enzymes will ensure that virtually to correlate to predisposition to disease, the presence of 

all transcripts of the desired abundance will be identified disease, and prognosis of disease, for example. An abun- 

The genes encoding those tags found to be most interesting ^anco profile, such as that depicted in Table 1, is useful for 

on the basis of their differential representation can be 10 the above described appucations. SAGE is also useful for 

positively identified by a combination of data-base detection of an organism (e.g., a pathogen) in a host or 

searching, hybridization, and sequence analysis as demon- detection of infection-specific genes expressed by a patho- 

strated in Table 2. Obviously, SAGE could also be applied g en m a nos t 
to the analysis of organisms other than humans, and could 

direct investigation towards genes expressed in specific 15 The abmty to identify a large number of expressed genes 

biologic states m a snort period of time, as described by SAGE in the 

SAGE, as described herein, dlows comparison of « P ks- P^^nt invention, provides unlimited uses, 

sion of numerous genes among tissues or among different Although the invention has been described with reference 

states of development of the same tissue, or between patho- to the presently preferred embodiment, it should be under- 

logic tissue and its normal counterpart. Such analysis is 20 stood that various modifications can be made without 

useful for identifying therapeutically, ctiagnostically and departing from the spirit of the invention. Accordingly, the 

prognostically relevant genes, for example. Among the invention is limited only by the following claims. 



SEQUENCE LISTING 



( 1 ) GENERAL INFORMATION: 

( i i i ) NUMBER OF SEQUENCES: 7 



( 2 ) INFORMATION FOR SEQ ID NOil : 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 43 base pairs 
< B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: both 
( D ) TOPOLOGY: bcxh 

( i i ) MOLECULE TYPE: DMA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

TTTTACCAGC TTATTCAATT CGGTCCTCTC GCACAGOGAC ATG 43 



( 2 ) INFORMATION FOR SEQ ID N02: 

< i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 36 hate pairs 
( B ) TYPE: nucleic acid 
( C > STRANDEDNESS: both 
( D ) TOPOLOGY: both 

( i i ) MOLECULE TYPE: DNA (genomic) 

< x i ) SEQUENCE DESCRIPTION: SEQ ID NO:2: 

ATGGTCGAAT AAGTTAAGCC AGGAGAGCGT GTCCCT 3 6 



( 2 ) INFORMATION FOR SEQ CD N03: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 44 b*»e pari 
( B ) TYPE: nucleic ackt 
( C ) STRANDEDNESS: both 
( D) TOPOLOGY: boch 

( i t ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NCh3 : 



TTTTTGTAGA CATTCTAOTa TCTCOTCAAO TCGGAAOOGA CATO 



15 
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( 2 ) INFORMATION FOR SEQ ID NO:4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 37ba*ep«w 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: both 
< D ) TOPOLOGY: both 

( x i ) MOLECULE TYPE: DNA (fieaottnc) 

< X i ) SEQUENCE DESCRIPTION: SEQ ID NO:4: 

AACATCtOTA AGATCATAOA GCAGTTCAOC CTTCCCT 



< 2 ) INFORMATION FOR SEQ ID NO;5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 21 hue pair* 
( B ) TYPE: nucleic acid 

< C ) STRANDEDNESS; both 

< D ) TOPOLOGY: bo* 

( i l ) MOLECULE TYPE: DNA (genomic) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NOS: 
CCAOCTTATT CAATTCGGTC C 2 1 



( 2 ) INFORMATION FOR SEQ 2D NOrf: 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 21 hue pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: both 
( D ) TOPOLOGY: boto 

( i i ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:& 

OTAOACATTC TAOTATCTCG T 2 1 



( 2 ) INFORMATION FOR SEQ ID NO:7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 26 bmc pair* 
( B ) TYPE: nu cleic acid 
( C ) STRANDEDNESS: both 
( D ) TOPOLOGY: bexfa 

( i i ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:7: 

OACOTCGACC TGAOOTAATT ATAACC 2 <S 



What is claimed is: 

1. An isolated oligonucleotide composition comprising at 
least one ditag, wherein the ditag comprises two covalently 
joined defined nucleotide sequence tags in opposite 
orientation, wherein each tag corresponds to at least one 
expressed gene. 

2. The composition of claim 1, wherein the oligonucle- 
otide consists of about 1 to 200 ditags. 

3. The composition of claim 2, wherein the oligonucle- 
otide consists of about 8 to 20 ditags. 

4. A method for the detection of gene expression com- 
prising: 

producing complementary deoxyribonucleic acid (cDNA) 
oligonucleotides; 



isolating a first defined nucleotide sequence tag from a 
first cDNA oligonucleotide and a second defined nucle- 
otide sequence tag from a second cDNA oligonucle- 
otide; 

linking the first tag to a first oligonucleotide linker thereby 
forming a first linked nucleic acid, wherein the first 
oligonucleotide linker comprises a first enzyme recog- 
nition site that allows DNA cleavage at a site in the first 
defined nucleotide sequence distant from the first rec- 
ognition site; 

linking the second tag to a second oligonucleotide linker 
thereby forming a second linked nucleic acid, wherein 
the second oligonucleotide linker comprises a second 
enzyme recognition site that allows DNA cleavage at a 
site in the second defined nucleotide sequence distant 
from the second recognition site; 
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cleaving the first and the second linked nucleic acids with ligating a second pool of tags with a second oligouucle- 

at least one enzyme that recognizes each of the recog- otide linker having a second enzyme recognition site 

nition sites: that allows DNA cleavage at a site distant from the 

ligating the first and second tags to form a ditag: and second recognition site; 

determining the nucleotide sequence of at least one tag of 5 cleaving the tags with a first and a second tag cleaving 

the ditag to detect gene expression. restriction endonuclease, wherein the first tag-clcaving 

5. The method of claim 4, wherein the first oligonucle- restriction endonuclease recognizes a first enzyme rec- 
otidc linker comprises a first amplification primer hybrid- ogrrition site and cleaves at a site distant from the first 
ization sequence, and the second oligonucleotide linker recognition site and wherein the second tag-cleaving 
comprises a second amplification primer hybridization 10 restriction endonuclease recognizes a second enzyme 
sequence: and recognition site and cleaves at a site distant from the 

further comprising amplifying the ditag oligonucleotide. second recognition site; 

6. The method of claim 4, further comprising producing ligating the two pools of tags to produce at least one ditag; 
concatemcrs of the ditag. antl 

7. The method of claim 6, wherein the concatemer con- determining the nucleotide sequence of at least one ditag, 
sists of about 2 to 200 ditags. wherein the ditag(s) correspond to sequence from an 

8. The method of claim 7, wherein the concatemer con- expressed gene. 

sists of about 8 to 20 ditags. 19. The method of claim 18, further comprising ampli- 

9. The method of claim 4, wherein the first and second tying the ditag. 

oligonucleotide linkers comprise the same nucleotide 20. The method of claim 18, wherein the first restriction 

sequence. endonuclease has at least one recognition site in the cDNA. 

10. The method of claim 4, wherein the first and second 21. The method of claim 20, wherein the first restriction 
oligonucleotide linkers comprise different nucleotide enzyme has a four base pair recognition site, 
sequences. 22. The method of claim 21, wherein the restriction 

XL The method of claim 10, wherein the first and second 25 endonuclease is Nlalll 

^oligonucleotide linkers have a sequence: 23. The method of claim 18, wherein the cDNA comprises 

S'-TTTTACCAGCITArTCAAITCGGTCCrCTCGCA- a means for capture. 

CAGGGACATG-3* (SEQ ID NO: 1) 24. The method of claim 23, wherein the means for 

3 ' -XTGGTCGAAXAAGTTAAGCC AG GAGAGCGTG- 3o ca P^ is a Priding element 

TCCCT-5' (SEQ ID NO:2) method of claim 24, wherein the binding element 

is biotin. 

_ f _ _ rT ,____,_ ~— -~ 26. The method of claim 18. wherein the first and second 

5 -H i i l GTAGAC ATrCTAGTXTCTCGTCAAGTCG- oligonucleotide linkers comprise the same nucleotide 

GAAGGGACATG-3* (SEQ ED NO:3) sequence. 

3 '-AAC ATCTGTAAG ATCATAGAGCAGTTCAGCCT- 35 27 ^ method of daim 18> wherein the first and second 

TCCCT-5*, (SEQ ID NO:4) oligonucleotide linkers comprise different nucleotide 

wherein A is dideoxy A. sequences. 

12. The method of claim 4, wherein at least one of the 28* The method of claim 27, wherein the first and second 
enzyme recognition sites is a type IIS endonuclease recog- oligonucleotide linkers have a sequence: 

nition site. 40 5 '-TTTTACCAGCT^A^^CAi^TTCGGTCCTCTCGC A- 

13. The method of claim 12, wherein the type US endo- CAGGGACATG-3 1 (SEQ ID NO:l) 

nuclease is selected from the group consisting of BsmFI and S'-^GGTCGAATAAGTTAAGCCAGGAGAGCCTG- 

FokL TCCCr-5* (SEQ ID NO:2) 

14. Ute method of claim 4, wherein the ditag is about 12 or 

to 60 Repairs. ^w^k^i* 4 5 f -T 1 GTAGACArTCTAGrATCTCGTCAAGTCG- 

15. The method of claim 14, wherein the ditag is about 18 GAAGGGACATG-3' (SEQ ID NO:3) 

t0 77 E!^Ia ^ ^ ivn c *™i;f™„ v, v 3 AAC ATCTGTAAGXT C ATAG AGC AGTTCAGC CT- 

po?^^^ 1 ^ C ^^ 8 " TCCCT-S^IDNO*) 

17. The method of claim 16, wherein primers for PCR are 50 wherexn A is dideoxy A. . 

selected from me group consisting of 29 ™» e method of claim 18, wherein at least one of the 

5'-CCAGCrTATTCAATTCGGTCC-3' (SEO ID NO:5) ^fTl » tes ^ a ^ 1 ? S ( f nd ^? se f e - 

30. The method of claim 29, wherein the type US endo- 

~~ * ~ + ~ ^ , _ __ _ _ nuclease is selected from the group consisting of BsmFI and 

5 -GTAGACATTCTAGTArCrCGT-3 f (SEQ ID NO:6). 55 FokL * F 6 

18. A method for detection of gene expression compris- 3L ^ ^ daim 18? wherc inthe ditag is about 12 
mg: to 60 base pairs. 

cleaving a cDNA sample with a first restriction 32. The method of claim 31, wherein the ditag is about 14 

endonuclease, wherein the endonuclease cleaves the to 22 base pairs. 

cDNA at a defined position in the cDNA thereby ^ 33. The method of claim 18, further comprising ligating 

producing defined sequence tags; the ditags to produce a concatemer. 

isolating the defined cDNA tags and forming a first pool 34. The method of claim 33, wherein the concatemer 

of tags; consists of about 2 to 200 ditags. 

ligating a first pool of tags with a first oligonucleotide 35. The method of claim 34, wherein the concatemer 

linker having a first enzyme recognition site that allows 65 consists of about 8 to 20 ditags. 

DNA cleavage at a site distant from the second 36. The method of claim 18, wherein the amplifying is by 

recognition site: polymerase chain reaction (PGR). 
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37. The method of claim 36, wherein primers for PGR are 
selected from the group consisting of 

S'-CCAGCTTATTCAArTCGGTCC-S* (SEQ ID NO:5) 
and 

5 f -GTAGACA^^CTAGTA^C^GCT-3 , (SEQ ID NO:6). 

38. A kit useful for detection of gene expression wherein 
the presence of a cDNA ditag is indicative of expression of 
a gene having a sequence of a tag of the ditag, the kit 
comprising one or more containers comprising a first con- 
tainer containing a first oligonucleotide linker having a first 
sequence useful hybridization of an amplification primer; a 
second container containing a second oligonucleotide linker 
having a second oligonucleotide linker having a second 
sequence useful hybridization of an amplification primer, 
wherein the linkers further comprise a restriction endonu- 
clease site for cleavage of DNA at a site distant from the 
restriction endonuclease recognition site; and a third and 
fourth container having a nucleic acid primers for hybrid- 
ization to the first and second unique sequences of the linker. 

39. The kit of claim 38, wherein the linkers have a 
sequence 

5 '-TTTTACCA GCTTAITC AAITCGGTCCTCTC GC A- 
CAGGGACATG-3' (SEQ ID NO:l) 
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3'-ArGGTCGAATAAGTTAAGCCAGGAGAGCGTG- 
TCCCT-5' (SEQ ID NO:2) 

or 

5 T r r r l ui AG AC ATTCTAGTATCTCGTCAAGTCG- 
GAAGGGACAXG-3' (SEQ ID N03) 

S'-AACATCrGTAAGArCATAGAGCAGTTCAGCCr- 

TCCCr^, (SEQ ID NO:4) 
wherein A is dideoxy A. 

40. The kit of claim 38, wherein the restriction endonu- 
clease is a type IIS endonuclease. 

4L The kit of claim 40, wherein the type US endonuclease 
is BsmFL 

42. The kit of claim 38, wherein the primers for ampli- 
fication are selected from the group consisting of 

5-CCAGCTTArTCAArTCGGTGC-3' (SEO ID NO:5) 
and 

5-GTAGACXITCTAGTArCrCGT--3 ' (SEQ ID NO:6). 

43. The method of claim 18, wherein the first oligonucle- 
otide linker comprises a fast amplification primer hybrid- 
ization sequence and me second oligonucleotide linker com- 
prises a second amplification primer hybridization sequence. 



